-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Flux simulator #2561
[WIP] Flux simulator #2561
Conversation
Yay! Take a victory lap! A few initial thoughts/questions: Let's peel off the python refcounting fix(es) to a standalone PR and get that in ASAP, even if it's not the final fix, so we don't need to carry it here and elsewhere. Could we improve on Rather than calling out the job record as "sacct format", should we define a flux format that is either compatible or that has a straightforward conversion path? The Is the strategy to have simulator.py unload the exec module and register a handler to replace it? Maybe we could find a better way to do that and then avoid the need to do module management from python. Maybe using the The big ticket item of course: should we revisit, now that we've had some experience building on our original job manager design, whether there are alternatives to the "quiescent" interface? It would be nice if any new synchronization mechanisms we introduce have some general utility beyond this use case. It would also be nice to have less intrusion into the scheduler. This is hard as I recall, so I'm not sure we'll get anywhere, but I'd feel better if we spent a bit more time thinking about it before committing to this approach. IMHO, peeling off some of the bits mentioned above into standalone PR's would help move this forward. Anyway, nice job getting this all wrapped up :-) |
converts flux ids from/to hex, dec, and kvs
add optional callbacks to notify schedutil users when there are no longer any outstanding futures/messages in the schedutil context (i.e., idle) and when the schedutil context goes from idle to busy (i.e., now has an outstanding future/message) useful for simulations where the scheduler needs to accurately respond to a `quiescent` request from the job-manager
The simulator can now send a `job-manager.quiescent` request, which will only be responded to when the entire system has quiesced (i.e., in the absence of new events/requests, the system will make no further changes - such as allocating or freeing jobs). For the simple scheduler, this simply means that the schedutil library is idle. The job-manager then sends its own `quiescent` request to the scheduler along with every alloc request. It will only respond to the simulator's request after its own request to the scheduler is responded to. In the future, this protocol will be expanded to include the exec and depend modules.
after receiving an alloc response from the scheduler, the job-manager emits an event, which triggers a `start` request to be sent to the exec system. The re-entrance into the reactor loop between the reception of the alloc response and sending the start request means that the job-manager has a chance to "pre-maturely" process the quiescent response from the scheduler. This ultimately leads to the simulator receiving an erroneous 'quiescent' response from the job-manager. A similar problem exists for outstanding start requests. To solve these problems, ensure that every alloc response has a corresponding start response before sending a quiescent request. Track the number of outstanding requests in the simulator context of the job manager, which is also the piece responsible for responding to the quiescent request.
ac28a43
to
ce68813
Compare
👍 Done.
Yeah, that makes sense. I was waffling between the two solution and went the Python route b/c it was expedient at the time, but exporting it from C seem cleaner.
Using a format other than "sacct" seems like a good idea to me. One option is using the "Parallel Workloads Archive's" "Standard Workload Format" (SWF). That is the closest thing to a common standard in the literature, although it is a bit outdated at this point. Another option would be what you suggest, to put together our own format, maybe one that natively supports Jobspec. That way it is easy to run simulations involving resources beyond nodes and cores. I think I'm leaning towards the latter since we plan on doing BB simulations in the short to medium term as part of an L2 milestone. We could include in our conversion script both the SWF as well as sacct.
Yeah, that is the current strategy. I'm definitely open to changing it. One easy tweak to the current strategy could be to remove module loading/unloading from the python and just have simulator-specific RC scripts that don't load the exec system. One of the benefit IMO of doing it from python is that all of the simulator-specific information and logic (including the simulated clock) is localized to a single file. IIUC, a simulator-specific struct exec_implementation would require some form of side-channel communication between the simulator and exec system to communicate:
Maybe we can discuss in more detail at coffee time.
Yeah, I agree that this solution isn't the most appealing from a conceptual level. As we discussed face-to-face, let's move forward with the quiescent interface for now, and we can revisit later on once we have some more discussions and better ideas. For now, I think the big benefits of the quiescent interface are that it:
👍 I'll start work on that now. |
This pull request introduces 5 alerts when merging ce68813 into ce510d3 - view on LGTM.com new alerts:
|
Per a face-to-face discussion with @garlick:
|
def insert_apriori_events(self, simulation): | ||
# TODO: add priority to `add_event` so that all submits for a given time | ||
# can happen consecutively, followed by the waits for the jobids | ||
simulation.add_event(self.submit_time, lambda: simulation.submit_job(self)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Astute observation from @mrwyattii: this logic should be contained in the simulator along with the other job event additions.
I originally planned to have the Job add all of it's own events so that the Simulation could remain agnostic of the job's lifecycle (submit -> run -> complete). That would make adding new job states like depend, grow, and shrink only require modifying the Job class. It would also allow adding new entities like a Resource (e.g., node, filesystem) more modular; they would each handle their own event adding and the Simulation could remain ignorant of their lifecycles. But that is probably left for another day and different PR.
Note from @mrwyattii's current research investigation. The current cancel method just raises the cancel exception. The simulator acting as the exec system does not actually process the cancel exception properly, so the job never makes it to the |
ce68813
to
b2afbad
Compare
EDIT: I just force pushed the commit (ce68813) that I had previously overwritten with an older commit. |
b2afbad
to
ce68813
Compare
Hello, will this PR be accepted ? Thank you. |
@adfaure, this PR is quite outdated, so it won't be accepted in its current form, though I think the plan to eventually update and merge this work.
It depends on what you mean by simulation. What are you looking to do? For example, the mainline version of flux-core can simulate job execution when the |
I am interested to understand the simulation capabilities of flux to have a global picture of what it offers, especially about scheduling simulation.
Next year, I managed to make the simulator of this PR working, I will try to do the same with the current master branch. Thank you for your quick answer. |
@adfaure, I will let @SteVwonder answer some of your specific questions.
The scheduler in Flux is an independent module. To develop a new scheduling algorithm you can either write a new scheduler module (using perhaps the extremely simple included scheduler as a starting point), or by developing new planner or matching plugins for the Fluxion graph based scheduler. We should perhaps move the last few comments here to our Discussions forum. Edit: Done. See #3718 |
Placing the simulator in deep freeze. To be resurrected at some future date when civilization has evolved to a higher level of consciousness. |
Initial support for the new simulator design within flux-core. It is a CLI tool that takes output files from
sacct
and re-executes the job trace through Flux using a simulated set of resources. Most of the logic is contained withinflux-simulator.py
, but there is some added logic to the job-manager and scheduler for determing "quiescence" (i.e., in the absence of new events/requests, the system will make no further changes - such as allocating or freeing jobs).I can peel the python bindings changes out into a separate PR if that is desirable (a few of the commits can be removed too once we close #2549).
Related: #1566